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Abstract 

We present a system that produces sentential de- 
scriptions of video: who did what to whom, 
and where and how they did it. Action class is 
rendered as a verb, participant objects as noun 
phrases, properties of those objects as adjecti- 
val modifiers in those noun phrases, spatial re- 
lations between those participants as preposi- 
tional phrases, and characteristics of the event 
as prepositional-phrase adjuncts and adverbial 
modifiers. Extracting the information needed to 
render these linguistic entities requires an ap- 
proach to event recognition that recovers object 
tracks, the track-to-role assignments, and chang- 
ing body posture. 



1 Introduction 

We present a system that produces sentential descriptions 
of short video clips. These sentences describe who did 
what to whom, and where and how they did it. This sys- 
tem not only describes the observed action as a verb, it also 
describes the participant objects as noun phrases, proper- 
ties of those objects as adjectival modifiers in those noun 
phrases, the spatial relations between those participants as 

* Corresponding author. Email: andrei@Oxab . com. 

Additional images and videos as well as all code and 
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coordination: and 

verbs: approached, arrived, attached, bounced, buried, carried, caught, 

chased, closed, collided, digging, dropped, entered, exchanged, 
exited, fell, fled, flew, followed, gave, got, had, handed, hauled, held, 
hit, jumped, kicked, left, lifted, moved, opened, passed, picked, 
pushed, put, raised, ran, received, replaced, snatched, stopped, 
threw, took, touched, turned, walked, went 

nouns: bag, ball, bench, bicycle, box, cage, car, cart, chair, dog, door, 

ladder, left, mailbox, microwave, motorcycle, object, person, right, 
skateboard, SUV, table, tripod, truck 

adjectives: big, black, blue, cardboard, crouched, green, narrow, other, pink, 
prone, red, short, small, tall, teal, toy, upright, white, wide, yellow 

prepositions: above, because, below, from, of, over, to, with 

lexical PPs: downward, leftward, rightward, upward 

determiners: an, some, that, the 

particles: away, down, up 

pronouns: itself, something, themselves 

adverbs: quickly, slowly 

auxiliary: was 

Table 1: The vocabulary used to generate sentential de- 
scriptions of video. 

prepositional phrases, and characteristics of the event as 
prepositional-phrase adjuncts and adverbial modifiers. It 
incorporates a vocabulary of 118 words: 1 coordination, 
48 verbs, 24 nouns, 20 adjectives, 8 prepositions, 4 lexi- 
cal prepositional phrases, 4 determiners, 3 particles, 3 pro- 
nouns, 2 adverbs, and 1 auxiliary, as illustrated in Table [T] 

Production of sentential descriptions requires recognizing 
the primary action being performed, because such actions 
are rendered as verbs and verbs serve as the central scaf- 
folding for sentences. However, event recognition alone 
is insufficient to generate the remaining sentential compo- 
nents. One must recognize object classes in order to render 
nouns. But even object recognition alone is insufficient to 
generate meaningful sentences. One must determine the 
roles that such objects play in the event. The agent, i.e. the 
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doer of the action, is typically rendered as the sentential 
subject while the patient, i.e. the affected object, is typi- 
cally rendered as the direct object. Detected objects that do 
not play a role in the observed event, no matter how promi- 
nent, should not be incorporated into the description. This 
means that one cannot use common approaches to event 



recognition, such as spatiotemporal bags of words (Laptev 
|et~aLl [20071 |Niebles et al-1 |205S1 [Scovanner et al.[ \2tXn), 



2008; Rodriguez et al. 


2008), and tracked feature points 


(|Liu et al. 


2009; Schu 


Idt et al. 2004[ Wang and Mori| 


2009 ) that do not determine the class of participant ob- 



jects and the roles that they play. Even combining such 
approaches with an object detector would likely detect ob- 
jects that don't participate in the event and wouldn't be able 
to determine the roles that any detected objects play. 

Producing elaborate sentential descriptions requires more 
than just event recognition and object detection. Generat- 
ing a noun phrase with an embedded prepositional phrase, 
such as the person to the left of the bicycle, requires deter- 
mining spatial relations between detected objects, as well 
as knowing which of the two detected objects plays a role 
in the overall event and which serves just to aid genera- 
tion of a referring expression to help identify the event par- 
ticipant. Generating a noun phrase with adjectival modi- 
fiers, such as the red ball, not only requires determining 
the properties, such as color, shape, and size, of the ob- 
served objects, but also requires determining whether such 
descriptions are necessary to help disambiguate the refer- 
ent of a noun phrase. It would be awkward to generate a 
noun phrase such as the big tall wide red toy cardboard 
trash can when the trash can would suffice. Moreover, one 
must track the participants to determine the speed and di- 
rection of their motion to generate adverbs such as slowly 
and prepositional phrases such as leftward. Further, one 
must track the identity of multiple instances of the same 
object class to appropriately generate the distinction be- 
tween Some person hit some other person and The person 
hit themselves. 



A common assumption in Linguistics (Jackendoff, 1983; 
Pinker, 1989) is that verbs typically characterize the in- 



teraction between event participants in terms of the gross 
changing motion of these participants. Object class and 
image characteristics of the participants are believed to be 
largely irrelevant to determining the appropriate verb la- 
bel for an action class. Participants simply fill roles in the 
spatiotemporal structure of the action class described by 
a verb. For example, an event where one participant (the 
agent) picks up another participant (the patient) consists of 
a sequence of two sub-events, where during the first sub- 
event the agent moves towards the patient while the patient 
is at rest and during the second sub-event the agent moves 
together with the patient away from the original location 
of the patient. While determining whether the agent is a 



person or a cat, and whether the patient is a ball or a cup, 
is necessary to generate the noun phrases incorporated into 
the sentential description, such information is largely irrel- 
evant to determining the verb describing the action. Simi- 
larly, while determining the shapes, sizes, colors, textures, 
etc. of the participants is necessary to generate adjectival 
modifiers, such information is also largely irrelevant to de- 
termining the verb. Common approaches to event recog- 
nition, such as spatiotemporal bags of words, spatiotem- 
poral volumes, and tracked feature points, often achieve 
high accuracy because of correlation with image or video 
properties exhibited by a particular corpus. These are of- 
ten artefactual, not defining properties of the verb meaning 
(e.g. recognizing diving by correlation with blue since it 
'happens in a pool' ( [Liu et al.||2009| p. 2002) or confusing 
basketball and volleyball 'because most of the time the [. . .] 
sports use very similar courts' ( [Ikizler-Cinibis and Sclaroff| 
20T0| p. 506)). 



2 The mind's eye corpus 

Many existing video corpora used to evaluate event 
recognition are ill-suited for evaluating sentential de- 
scriptions. For example, the Weizmann dataset ([Blank 
|et al.[ |2005] ) and the kth dataset flSchuldt et aT) |2004| ) 
depict events with a single human participant, not ones 
where people interact with other people or objects. 
For these datasets, the sentential descriptions would 
contain no information other than the verb, e.g. The 
person jumped. Moreover, such datasets, as well as the 
Sports Actions datas et ([Rodriguez et al.| |2008| ) and 
the Youtube dataset ( |Liu et al.| |2009| ), often make 
action-class distinctions that are irrelevant to the choice 
of verb, e.g. wavel vs. wave 2, jump vs. pjump, 
Golf-Swing-Back vs. Golf-Swing-Front 
vs. Golf-Swing-Side, Kicking-Front 
vs. Kicking-Side, Swing-Bench vs. 

Swing-SideAngle, and golf.swing vs. 
tennis_swing vs. swing Other datasets, such as 
the Ballet dataset (|Wang and Mori| |2009| ) and the 
UCF50 dataset (Liu et al. 2009| ), depict larger-scale activi- 
ties that bear activity-class names that are not well suited to 
sentential description, e.g. Basketball, Billiards, 
Breast St roke, CleanAndJerk, HorseRace, 
HulaHoop, MilitaryParade, TaiChi, and YoYo. 

The year-one (Yl) corpus produced by DARPA for the 
Mind's Eye program, however, was specifically designed 
to evaluate sentential description. This corpus contains two 
parts: the development corpus, C-Dl, which we use solely 
for training, and the evaluation corpus, C-El, which we 
use solely for testing. Each of the above is further di- 
vided into four sections to support the four task goals of 
the Mind's Eye program, namely recognition, description, 
gap filling, and anomaly detection. In this paper, we use 
only the recognition and description portions and apply our 
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entire sentential-description pipeline to the combination of 
these portions. While portions of C-El overlap with C-Dl, 
in this paper we train our methods solely on C-Dl and test 
our methods solely on the portion of C-El that does not 
overlap with C-Dl. 

Moreover, a portion of the corpus was synthetically gener- 
ated by a variety of means: computer graphics driven by 
motion capture, pasting foregrounds extracted from green 
screening onto different backgrounds, and intensity vari- 
ation introduced by postprocessing. In this paper, we 
exclude all such synthetic video from our test corpus. 
Our training set contains 3480 videos and our test set 749 
videos. These videos are provided at 720p@30fps and 
range from 42 to 1727 frames in length, with an average 
of 435 frames. 

The videos nominally depict 48 distinct verbs as listed in 
Table [T] However, the mapping from videos to verbs is not 
one-to-one. Due to polysemy, a verb may describe more 
than one action class, e.g. leaving an object on the table vs. 
leaving the scene. Due to synonymy, an action class may 
be described by more than one verb, e.g. lift vs. raise. An 
event described by one verb may contain a component ac- 
tion described by a different verb, e.g. picking up an object 
vs. touching an object. Many of the events are described 
by the combination of a verb with other constituents, e.g. 
have a conversation vs. have a heart attack. And many 
of the videos depict metaphoric extensions of verbs, e.g. 
take a puff on a cigarette. Because the mapping from 
videos to verbs is subjective, the corpus comes labeled with 
DARPA-collected human judgments in the form of a single 
present/absent label associated with each video paired with 
each of the 48 verbs, gathered using Amazon Mechanical 
Turk. We use these labels for both training and testing as 
described later. 



3 Overall system architecture 

The overall architecture of our system is depicted in Fig.[T] 
We first apply detectors (Felzenszwal b et aL}|2010a|b"| ) for 
each object class on each frame of each video. These de- 
tectors are biased to yield many false positives but few 



Video 



false negatives. The Kanade-Lucas-Tomasi (KLT) ( |Shi and 
|Tomasi| |1994[ |Tomasi and Kanade} [1991 ) feature tracker 
is then used to project each detection five frames forward 
to augment the set of detections and further compensate 
for false negatives in the raw detector output. A dynamic- 
programming algorithm ( |Viterbi| |1971| ) is then used to se- 
lect an optimal set of detections that is temporally coher- 
ent with optical flow, yielding a set of object tracks for 
each video. These tracks are then smoothed and used to 
compute a time- series of feature vectors for each video to 
describe the relative and absolute motion of event partici- 
pants. The person detections are then clustered based on 
part displacements to derive a coarse measure of human 
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Figure 1 : The overall architecture of our system for pro- 
ducing sentential descriptions of video. 



body posture in the form of a body-posture codebook. The 
codebook indices of person detections are then added to the 
feature vector. Hidden Markov Models (HMMs) are then 
employed as time-series classifiers to yield verb labels for 
each video flSiskind and Morris| |1996[ |Starner et aL) |1998[ 
Wang and Monl|200^|Xuet al.||2002[|2005) , together with 



the object tracks of the participants in the action described 
by that verb along with the roles they play. These tracks are 
then processed to produce nouns from object classes, ad- 
jectives from object properties, prepositional phrases from 
spatial relations, and adverbs and prepositional-phrase ad- 
juncts from track properties. Together with the verbs, these 
are then woven into grammatical sentences. We describe 
each of the components of this system in detail below: the 
object detector and tracker in Secti on|3.1| the body-posture 
clustering and codebook in Section [3.2| the event classifier 
in Secti on |3. 3 1 and the sentential-description component in 
Section 13^1 

3.1 Object detection and tracking 

We employ detection-based tracking as described in Sec- 
tion 2 of a parallel submission (id: 568) In detection-based 
tracking an object detector is applied to each frame of a 
video to yield a set of candidate detections which are com- 
posed into tracks by selecting a single candidate detection 
from each frame that maximizes temporal coherency of the 
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track. Felzenszwalb et al. detectors are used for this pur- 
pose. Detection-based tracking requires biasing the detec- 
tor to have high recall at the expense of low precision to al- 
low the tracker to select boxes to yield a temporally coher- 
ent track. This is done by depressing the acceptance thresh- 
olds. To prevent massive over-generation of false positives, 
which would severely impact run time, we limit the number 
of detections produced per- frame to 12. 

Two practical issues arise when depressing acceptance 
thresholds. First, it is necessary to reduce the degree 
of non-maximal suppression incorporated in the Felzen- 
szwalb et al. detectors. Second, with the star detector 
(Felzenszw alb et alT] |2010b| ), one can simply decrease the 
single trained acceptance threshold to yield more detec- 
tions with no increase in computational complexity. How- 



ever, we prefer to use the star cascade detector ( Felzen 
|szwalb et al.| |2010a| ) as it is far faster. With the star cas 
cade detector, though, one must also decrease the trained 
root- and part- filter thresholds to get more detections. Do- 
ing so, however, defeats the computational advantage of the 
cascade and significantly increases detection time. We thus 
train a model for the star detector using the standard pro- 
cedure on human-annotated training data, sample the top 
detections produced by this model with a decreased accep- 
tance threshold, and train a model for the star cascade de- 
tector on these samples. This yields a model that is almost 
as fast as one trained by the star cascade detector on the 
original training samples but with the desired bias in ac- 
ceptance threshold. 

The Yl corpus contains approximately 70 different object 
classes that play a role in the depicted events. Many of 
these, however, cannot be reliably detected with the Felzen- 
szwalb et al. detectors that we use. We trained models for 
25 object classes that can be reliably detected, as listed in 
Table [2] These object classes account for over 90% of the 
event participants. Person models were trained with ap- 
proximately 2000 human-annotated positive samples from 
C-Dl while nonperson models were trained with approxi- 
mately 1000 such samples. For each positive training sam- 
ple, two negative training samples were randomly gener- 
ated from the same frame constrained to not overlap sub- 
stantially with the positive samples. We trained three dis- 
tinct person models to account for body-posture variation 
and pool these when constructing person tracks. The de- 
tection scores were normalized for such pooled detections 
by a per-model offset computed as follows: A (50 bin) his- 
togram was computed of the scores of the top detection in 
each frame of a video. The offset is then taken to be the 
minimum of the value that maximizes the between-class 



variance ( |Otsu| |1979| ) when bipartitioning this histogram 
and the trained acceptance threshold offset by a fixed, but 
small, amount (0.4). 

We employed detection-based tracking for all 25 object 
models on all 749 videos in our test set. To prune the 



large number of tracks thus produced, we discard all tracks 
corresponding to certain object models on a per- video ba- 
sis: those that exhibit high detection-score variance over 
the frames in that video as well as those whose detection- 
score distributions are neither unimodal nor bimodal. The 
parameters governing such pruning were determined solely 
on the training set. The tracks that remain after this pruning 
still account for over 90% of the event participants. 

3.2 Body-posture codebook 

We recognize events using a combination of the motion of 
the event participants and the changing body posture of the 
human participants. Body-posture information is derived 
using the part structure produced as a by-product of the 
Felzenszwalb et al. detectors. While such information is 
far noisier and less accurate than fitting precise articulated 
model s ([Andriluka et aL| |2008[ |Bregler[|1997HGavrila and| 



Davisl[l995[|Sigal et al.[[20Tot|Yang and Ramanan||2011| ) 
and appears unintelligible to the human eye, as shown in 



Section [33} it suffices to improve event-recognition accu- 
racy. Such information can be extracted from a large unan- 
notated corpus far more robustly than possible with precise 
articulated models. 

Body-posture information is derived from part structure 
in two ways. First, we compute a vector of part dis- 
placements, each displacement as a vector from the de- 
tection center to the part center, normalizing these vectors 
to unit detection-box area. The time-series of feature vec- 
tors is augmented to includes these part displacements and 
a finite-difference approximation of their temporal deriva- 
tives as continuous features for person detections. Sec- 
ond, we vector-quantize the part-displacement vector and 
include the codebook index as a discrete feature for person 
detections. Such pose features are included in the time- 
series on a per-frame basis. The codebook is trained by 
running each pose-specific person detector on the positive 
human-annotated samples used to train that detector and 
extract the resulting part-displacement vectors. We then 
pool the part-displacement vectors from the three pose- 
specific person models and employ hierarchical /c-means 
clustering using Euclidean distance to derive a codebook 
of 49 clusters. Fig. [2] shows sample clusters from our code- 
book. Codebook indices are derived using Euclidean dis- 
tance from the means of these clusters. 

3.3 Event classification 

Our tracker produces one or more tracks per object class 
for each video. We convert such tracks into a time- series of 
feature vectors. For each video, one track is taken to des- 
ignate the agent and another track (if present) is taken to 
designate the patient. During training, we manually spec- 
ify the track-to-role mapping. During testing, we automat- 
ically determine the track-to-role mapping by examining 
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Table 2: Trained models for object classes and their mappings to (a) nouns, (b) restrictive adjectives, and (c) size adjectives. 
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Figure 2: Sample clusters from our body-posture codebook. 



all possible such mappings and selecting the one with the 
highest likelihood (Siskind and Morris, 1996). 

The feature vector encodes both the motion of the event 
participants and the changing body posture of the human 
participants. For each event participant in isolation we in- 
corporate the following single-track features: 

1 . x and y coordinates of the detection-box center 

2. detection-box aspect ratio and its temporal derivative 

3. magnitude and direction of the velocity of the 
detection-box center 

4. magnitude and direction of the acceleration of the 
detection-box center 

5. normalized part displacements and their temporal 
derivatives 

6. object class (the object detector yielding the detection) 

7. root-filter index 

8. body -posture codebook index 

The last three features are discrete; the remainder are con- 
tinuous. For each pair of event participants we incorporate 
the following track-pair features: 

1 . distance between the agent and patient detection-box 
centers and its temporal derivative 

2. orientation of the vector from agent detection-box 
center to patient detection-box center 

Our HMMs assume independent output distributions for 
each feature. Discrete features are modeled with discrete 
output distributions. Continuous features denoting linear 
quantities are modeled with univariate Gaussian output dis- 
tributions, while those denoting angular quantities are mod- 
eled with von Mises output distributions. 

For each of the 48 action classes, we train two HMMs on 
two different sets of time- series of feature vectors, one con- 



taining only single-track features for a single participant 
and the other containing single-track features for two par- 
ticipants along with the track-pair features. A training set 
of between 16 and 200 videos was selected manually from 
C-Dl for each of these 96 HMMs as positive examples de- 
picting each of the 48 action classes. A given video could 
potentially be included in the training sets for both the one- 
track and two-track HMMs for the same action class and 
even for HMMs for different action classes, if the video 
was deemed to depict both action classes. 

During testing, we generate present/absent judgments for 
each video in the test set paired with each of the 48 action 
classes. We do this by thresholding the likelihoods pro- 
duced by the HMMs. By varying these thresholds, we can 
produce an ROC curve for each action class, comparing 
the resulting machine-generated present/absent judgments 
with the Amazon Mechanical Turk judgments. When do- 
ing so, we test videos for which our tracker produces two or 
more tracks against only the two-track HMMs while we test 
ones for which our tracker produces a single track against 
only the one-track HMMs. 

We performed three experiments, training 96 different 200- 
state HMMs for each. Experiment I omitted all discrete 
features and all body -posture related features. Experi- 
ment II omitted only the discrete features. Experiment III 
omitted only the continuous body-posture related features. 
ROC curves for each experiment are shown in Fig. [3] Fig. [4] 
and Fig. [5] Note that the incorporation of body-posture 
information, either in the form of continuous normalized 
part displacements or discrete codebook indices, improves 
event-recognition accuracy, despite the fact that the part 
displacements produced by the Felzenszwalb et al. detec- 
tors are noisy and appear unintelligible to the human eye. 
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Figure 3: ROC curves for each of the 48 action classes for Experiment I omitting all discrete and body-posture-related 
features. 
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Figure 4: ROC curves for each of the 48 action classes for Experiment II omitting only the discrete features. 
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Figure 5: ROC curves for each of the 48 action classes for Experiment III omitting only the continuous body -posture- 
related. 
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3.4 Generating sentences 

We produce a sentence from a detected action class to- 
gether with the associated tracks using the templates from 
Table [3] In these templates, words in italics denote fixed 
strings, words in bold indicate the action class, X and Y 
denote subject and object noun phrases, and the categories 
Adv, PPendo, and PP ex o denote adverbs and prepositional- 
phrase adjuncts to describe the subject motion. The pro- 
cesses for generating these noun phrases, adverbs, and 
prepositional-phrase adjuncts are described below. One- 
track HMMs take that track to be the agent and thus the 
subject. For two-track HMMs we choose the mapping from 
tracks to roles that yields the higher likelihood and take the 
agent track to be the subject and the patient track to be the 
object except when the action class is either approached or 
fled, the agent is (mostly) stationary, and the patient moves 
more than the agent. 

Brackets in the templates denote optional entities. Op- 
tional entities containing Y are generated only for two- 
track HMMs. The criteria for generating optional ad- 
verbs and prepositional phrases are described below. The 
optional entity for received is generated when there is 
a patient track whose category is mailbox, person, 
person-crouch, or person-down. 

We use adverbs to describe the velocity of the subject. For 
some verbs, a velocity adverb would be awkward: 



*X slowly had Y 



*X had slowly Y 



Furthermore, stylistic considerations dictate the syntactic 
position of an optional adverb: 



X jumped slowly over Y 
X slowly approached Y 
?X slowly fell 



X slowly jumped over Y 
*X approached slowly Y 
Xfell slowly 



The verb-phrase templates thus indicate whether an adverb 
is allowed, and if so whether it occurs, preferentially, pre- 
verbally or postverbally. Adverbs are chosen subject to 
three thresholds ^f tion class , vf iion class , and v | ctionclass deter- 
mined empirically on a per- action-class basis: We select 
those frames from the subject track where the magnitude of 
the velocity of the box-detection center is above vf tlon class . 
An optional adverb is generated by comparing the mag- 
nitude of the average velocity v of the subject track box- 
detection centers in these frames to the per-action-class 
thresholds: 



quickly v > v% 
slowly vf 



action class 



.action class 



< V < V 



action class 



We use prepositional-phrase adjuncts to describe the mo- 
tion direction of the subject. Again, for some verbs, such 
adjuncts would be awkward: 



*X had Y leftward 



*X had Y from the left 



Moreover, for some verbs it is natural to describe the mo- 
tion direction endogenously, from the perspective of the 



subject, while for others it is more natural to describe the 
motion direction exogenously, from the perspective of the 
viewer: 



Xfell leftward 

X chased Y leftward 

*X arrived leftward 



Xfell from the left 

*X chased Y from the left 

X arrived from the left 



The verb-phrase templates thus indicate whether an adjunct 
is allowed, and if so whether it is preferentially endogenous 
or exogenous. The choice of adjunct is determined from 
the orientation of v, as computed above and depicted in 
Fig.^aJ}). We omit the adjunct when v < vf tion class . 

We generate noun phrases X and Y to refer to event partic- 
ipants according to the following grammar: 

NP ->> themselves | itself | something | D A* N [PP] 
D — >• the | that \ some 

When instantiating a sentential template that has a required 
object noun-phrase Y for a one-track HMM, we generate a 
pronoun. A pronoun is also generated when the action class 
is entered or exited and the patient class is not car, door, 
suv, ortruck. The anaphor themselves is generated if the 
action class is attached or raised, the anaphor itself if the 
action class is moved, and something otherwise. 

As described below, we generate an optional prepositional 
phrase for the subject noun phrase to describe the spatial 
relation between the subject and the object. We choose the 
determiner to handle coreference, generating the when a 
noun phrase unambiguously refers to the agent or the pa- 
tient due to the combination of head noun and any adjec- 
tives, 

The person jumped over the ball. 

The red ball collided with the blue ball 

that for an object noun phrase that corefers to a track re- 
ferred to in a prepositional phrase for the subject, 

The person to the right of the car approached that car. 
Some person to the right of some other person ap- 
proached that other person. 

and some otherwise: 

Some car approached some other car. 

We generate the head noun of a noun phrase from the ob- 
ject class using the mapping in Table [2ja). Four different 
kinds of adjectives are generated: color, shape, size, and 
restrictive modifiers. An optional color adjective is gener- 
ated based on the average HSV values in the eroded detec- 
tion boxes for a track: black when V < 0.2, white when 
V > 0.8, one of red, blue, green, yellow, teal, or pink 
based on H, when S > 0.7. An optional size adjective is 
generated in two ways, one from the object class using the 
mapping in Table [2jc), the other based on per-object-class 
image statistics. For each object class, a mean object size 
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X [Adv] approached Y [PP exo ] 
X arrived [Adv] [PP exo ] 
X [Adv] attached an object to Y 
X bounced [Adv] [PP en do] 
X buried Y 

X [Adv] carried Y [PP en do] 
X caught Y[PP exo ] 
X [Adv] chased Y [PP endo ] 
X closed Y 

X [Adv] collided with Y [PP exo ] 
X was digging [with Y] 
X dropped Y 



X [Adv] entered Y [PP endo ] 

X [Adv] exchanged an object with Y 

X[Adv] exited Y[PP endo ] 

X fell [Adv] [because of Y] [PP end o] 

X fled [Adv] \from Y] [PP endo ] 

Xflew [Adv] [PP endo ] 

X [Adv] followed Y [PP en do] 

X got an object from Y 

X gave an object to Y 

X went [Adv] away [PP en do] 

X handed Y an object 

X [Adv] hauled Y [PP en do] 



X had Y 

X hit [something with] Y 
X held Y 

X jumped [Adv] [over Y] 
X [Adv] kicked Y [PP end o] 
X left [Adv] [PP endo ] 
X [Adv] lifted Y 
X [Adv] moved Y [PP en do] 
X opened Y 

X [Adv] passed Y [PP ex0 ] 

X picked Y up 

X [Adv] pushed Y [PP en do] 



X put Y down 
X raised Y 

X received [an object from]Y 
PPendo] X [Adv] replaced Y 

X ran [Adv] [to Y] [PP en do] 

X [Adv] snatched an object from Y 

X [Adv] stopped [Y] 

X [Adv] took an object from Y 

X [Adv] threw Y [PP endo ] 

X touched Y 

X turned [PP en do] 

X walked [Adv] [to Y] [PP en do] 



Table 3: Sentential templates for the action classes indicated in bold. 



leftward and upward upward rightward and upward from above and to the left from above from the right above and to the left of Y above Y above and to the right of Y 



leftward 




rightward 



from the left 




from the right 



to the left of Y 




to the right of Y 



leftward and downward downward rightward and downward from below and to the left from below from below and to the right below and to the left of Y below Y below and to the right of Y 



(a) 



(b) 



(c) 



Figure 6: (a) Endogenous and (b) exogenous prepositional-phrase adjuncts to describe subject motion direction, (c) Prepo- 
sitional phrases incorporated into subject noun phrases describing viewer-relative 2D spatial relations between the subject 
X and the reference object Y. 



^object class is determined by averaging the detected-box ar- 
eas over all tracks for that object class in the training set 
used to train HMMs. An optional size adjective for a track 
is generated by comparing the average detected-box area a 
for that track to a bject class : 

big CL ^> /^object class ^object class 
SfYlClll Qj < ^object class ^object class 

The per-object-class cutoff ratios abject class and /Object class 
are computed to equally tripartition the distribution of per- 
object-class mean object sizes on the training set. Op- 
tional shape adjectives are generated in a similar fashion. 
Per-object-class mean aspect ratios f bj ec t class are deter- 
mined in addition to the per-object-class mean object sizes 
^object class- Optional shape adjectives for a track are gener- 
ated by comparing the average detected-box aspect ratio r 
and area a for that track to these means: 



T ^ 0-7r o bject class A (2 ^ /3 OD j ect class ^object class 



tall 

Short T > l-3r o bject class A (2 ^ ^object class ^object class 



narrow r < 0.7r ob ject class A a < a objectc i ass a objectc i ass 

Wide T > l-3r bject class A CL > /^object class ^object class 

To avoid generating shape and size adjectives for unstable 
tracks, they are only generated when the detection-score 
variance and the detected aspect-ratio variance for the track 
are below specified thresholds. Optional restrictive modi- 
fiers are generated from the object class using the mapping 
in Table |2jb). Person-pose adjectives are generated from 
aggregate body -posture information for the track: object 



class, normalized part displacements, and body-posture 
codebook indices. We generate all applicable adjectives 
except for color and person pose. Following the Gricean 
Maxim of Quantity ( |Grice| [1975 ), we only generate color 
and person-pose adjectives if needed to prevent coreference 
of nonhuman event participants. Finally, we generate an 
initial adjective other, as needed to prevent coreference. 
Generating other does not allow generation of the deter- 
miner the in place of that or some. We order any adjectives 
generated so that other comes first, followed by size, shape, 
color, and restrictive modifiers, in that order. 

For two-track HMMs where neither participant moves, a 
prepositional phrase is generated for subject noun phrases 
to describe the static 2D spatial relation between the subject 
X and the reference object Y from the perspective of the 
viewer, as shown in Fig.[6jc). 

4 Experimental results 

We used the HMMs generated for Experiment III to com- 
pute likelihoods for each video in our test set paired with 
each of the 48 action classes. For each video, we gener- 
ated sentences corresponding to the three most-likely ac- 
tion classes. Fig. [7] shows key frames from four videos 
in our test set along with the sentence generated for the 
most-likely action class. Human judges rated each video- 
sentence pair to assess whether the sentence was true of the 
video and whether it described a salient event depicted in 
that video. 26.7% (601/2247) of the video-sentence pairs 
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were deemed to be true and 7.9% (178/2247) of the video- 
sentence pairs were deemed to be salient. When restrict- 
ing consideration to only the sentence corresponding to 
the single most-likely action class for each video, 25.5% 
(191/749) of the video- sentence pairs were deemed to be 
true and 8.4% (63/749) of the video-sentence pairs were 
deemed to be salient. Finally, for 49.4% (370/749) of the 
videos at least one of the three generated sentences was 
deemed true and for 18.4% (138/749) of the videos at least 
one of the three generated sentences was deemed salient. 

5 Conclusion 



Integration of Language an d Vision ([Aloimonos et aT) 
20TT] |Barzialy et al.| |2003l parrell et al.| |2011[ |McKfr 



vitt||1994||1995-1996|) and recogn ition of action in video 
([Blank et al.| |2005| |Laptev et al.| [20081 l Liu et al -l [20091 
Rodrig uezet al.[ |2008[ |Schuldt et al.[ |2004[ |Siskind and| 
Morris[ |1996| |Starner et al.| |1998[ |Wang and Mori[ |2009[ 



Xuetal.[|2002T 2005) have been of considerable interest for 
a long time. There has also been work on generating sen- 



tential descriptions of static images (Farhadi et al.| 2009 1 



IKulkarni et all [MTT] |Yao et al.[ |2010| ). Yet we are un 



aware of any prior work that generates as rich sentential 
video descriptions as we describe here. Producing such 
rich descriptions requires determining event participants, 
the mapping of such participants to roles in the event, and 
their motion and properties. This is incompatible with com- 
mon approaches to event recognition, such as spatiotem- 
poral bags of words, spatiotemporal volumes, and tracked 
feature points that cannot determine such information. The 
approach presented here recovers the information needed 
to generate rich sentential descriptions by using detection- 
based tracking and a body-posture codebook. We demon- 
strated the efficacy of this approach on a corpus 749 videos. 
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